随机森林¶

作业要求:• 选择一个公开数据集 •使用随机森林进行分类(二分类,多分类) • 设置测试集 •研究抽样比例对最终结果的影响 •研究决策树数目对最终结果的影响 • 了解你的数据 •数据是否均衡? • 抽样比例与决策树数目之间有最优组合吗? •实验中是否出现了重复的树?即总是给出同样结果的树。 • 如何确定单一的决策树出现过拟合? •随机森林避免了过拟合?

  • 选择公开数据集做分类(支持二分类或多分类)
  • 使用随机森林进行实验
  • 研究以下几个因素的影响:
    • 抽样比例(每棵树训练时用多少样本)
    • 决策树数量(森林里有多少棵树)
    • 是否存在重复树(是否有树结构完全相同)
    • 单棵树是否过拟合,随机森林是否避免了过拟合
  • 分析数据是否均衡,探索抽样比例与树数的“最优组合”

加载数据集¶

In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

# 直接从本地文件加载
df = pd.read_csv('bank-additional-full.csv', sep=';')
X = df.drop('y', axis=1)
y = df['y']

# 定义类别名称
class_names = ['no', 'yes']

# 查看数据基本信息
print("数据集形状:", X.shape)
print("\n特征名称:", X.columns.tolist())
print("\n目标类别:", class_names)
print("\n数据集前5行:")
print(X.head())

# 查看数据集的类别分布
plt.figure(figsize=(8, 4))
sns.countplot(x=y)
plt.title('Bank Marketing Target Distribution')
plt.show()

# 查看部分重要特征之间的关系
selected_features = X.select_dtypes(include=['int64','float64']).columns[:4] # 选择前4个数值型特征
plot_data = pd.concat([X[selected_features], y], axis=1)
plt.figure(figsize=(10, 6))
sns.pairplot(plot_data, hue='y')
plt.show()
数据集形状: (41188, 20)

特征名称: ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

目标类别: ['no', 'yes']

数据集前5行:
   age        job  marital    education  default housing loan    contact  \
0   56  housemaid  married     basic.4y       no      no   no  telephone   
1   57   services  married  high.school  unknown      no   no  telephone   
2   37   services  married  high.school       no     yes   no  telephone   
3   40     admin.  married     basic.6y       no      no   no  telephone   
4   56   services  married  high.school       no      no  yes  telephone   

  month day_of_week  duration  campaign  pdays  previous     poutcome  \
0   may         mon       261         1    999         0  nonexistent   
1   may         mon       149         1    999         0  nonexistent   
2   may         mon       226         1    999         0  nonexistent   
3   may         mon       151         1    999         0  nonexistent   
4   may         mon       307         1    999         0  nonexistent   

   emp.var.rate  cons.price.idx  cons.conf.idx  euribor3m  nr.employed  
0           1.1          93.994          -36.4      4.857       5191.0  
1           1.1          93.994          -36.4      4.857       5191.0  
2           1.1          93.994          -36.4      4.857       5191.0  
3           1.1          93.994          -36.4      4.857       5191.0  
4           1.1          93.994          -36.4      4.857       5191.0  
No description has been provided for this image
/Users/LeeChung/anaconda3/envs/ml_env/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

划分数据集¶

In [31]:
# 在创建随机森林之前添加数据预处理代码
from sklearn.preprocessing import LabelEncoder
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE  # 改用SMOTE过采样


# 对分类特征进行编码
categorical_columns = X.select_dtypes(include=['object']).columns
label_encoders = {}

for column in categorical_columns:
    label_encoders[column] = LabelEncoder()
    X[column] = label_encoders[column].fit_transform(X[column])

# 确保y也是数值型
if isinstance(y, pd.DataFrame):
    y = y.iloc[:,0]  # 取第一列
if y.dtype == 'object':
    y = LabelEncoder().fit_transform(y)

# 将y转换为numpy数组
y = np.array(y)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2,      # 测试集占20%
    random_state=42,    # 设置随机种子,确保结果可复现
    stratify=y          # 保持分层抽样,确保训练集和测试集中类别比例一致
)

# # 创建欠采样器
# rus = RandomUnderSampler(random_state=42)

# # 对训练集进行欠采样
# X_train, y_train = rus.fit_resample(X_train, y_train)

# # 创建SMOTE过采样器
# smote = SMOTE(random_state=42)

# # 对训练集进行过采样
# X_train, y_train = smote.fit_resample(X_train, y_train)

# 打印数据集大小
print("训练集大小:", X_train.shape)
print("测试集大小:", X_test.shape)

# 检查类别分布是否均衡
print("\n训练集类别分布:")
print(pd.Series(y_train).value_counts())
print("\n测试集类别分布:")
print(pd.Series(y_test).value_counts())

# 可视化训练集和测试集的类别分布
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4),dpi = 300)

# 使用Series进行可视化
sns.countplot(x=pd.Series(y_train), ax=ax1)
ax1.set_title('Train Set Class Distribution')

sns.countplot(x=pd.Series(y_test), ax=ax2)
ax2.set_title('Test Set Class Distribution')

plt.tight_layout()
plt.show()
训练集大小: (32950, 20)
测试集大小: (8238, 20)

训练集类别分布:
0    29238
1     3712
Name: count, dtype: int64

测试集类别分布:
0    7310
1     928
Name: count, dtype: int64
No description has been provided for this image

基础随机森林模型¶

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# 创建随机森林分类器
rf = RandomForestClassifier(
    n_estimators=140,        # 决策树数量
    max_depth=10,            # 树的最大深度
    max_samples=0.8,        # 训练单棵树时使用的样本比例
    bootstrap=True,         # 是否使用放回抽样
    random_state=42         # 随机种子
)

# 训练模型
rf.fit(X_train, y_train)

# 在训练集和测试集上进行预测
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

# 定义类别名称
class_names = ['no', 'yes']  # 因为是二分类问题,所以手动定义类别名称

# 输出模型评估结果
print("="*50)
print("训练集评估结果:")
print("-"*50)
print("准确率:", rf.score(X_train, y_train))
print("\n分类报告:")
print(classification_report(y_train, y_train_pred, target_names=class_names))

print("\n测试集评估结果:")
print("-"*50)
print("准确率:", rf.score(X_test, y_test))
print("\n分类报告:")
print(classification_report(y_test, y_test_pred, target_names=class_names))

# ...其余代码保持不变...
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 5), dpi=300)

# 训练集混淆矩阵
cm_train = confusion_matrix(y_train, y_train_pred)
cm_train_norm = cm_train.astype('float') / cm_train.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_train_norm, annot=True, fmt='.2f', ax=ax1, cmap='Blues')
ax1.set_title('Train Set Confusion Matrix (Normalized)')
ax1.set_xlabel('Predicted Label')
ax1.set_ylabel('True Label')

# 测试集混淆矩阵
cm_test = confusion_matrix(y_test, y_test_pred)
cm_test_norm = cm_test.astype('float') / cm_test.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_test_norm, annot=True, fmt='.2f', ax=ax2, cmap='Blues')
ax2.set_title('Test Set Confusion Matrix (Normalized)')
ax2.set_xlabel('Predicted Label')
ax2.set_ylabel('True Label')

plt.tight_layout()
plt.show()

# 输出特征重要性
feature_importance = pd.DataFrame({
    'feature': X.columns.tolist(),
    'importance': rf.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(10, 5))
sns.barplot(x='importance', y='feature', data=feature_importance)
plt.title('Feature Importance')
plt.show()
==================================================
训练集评估结果:
--------------------------------------------------
准确率: 0.9407587253414263

分类报告:
              precision    recall  f1-score   support

          no       0.95      0.99      0.97     29238
         yes       0.86      0.57      0.68      3712

    accuracy                           0.94     32950
   macro avg       0.90      0.78      0.83     32950
weighted avg       0.94      0.94      0.94     32950


测试集评估结果:
--------------------------------------------------
准确率: 0.9214615197863559

分类报告:
              precision    recall  f1-score   support

          no       0.93      0.98      0.96      7310
         yes       0.74      0.46      0.57       928

    accuracy                           0.92      8238
   macro avg       0.84      0.72      0.76      8238
weighted avg       0.91      0.92      0.91      8238

No description has been provided for this image
No description has been provided for this image

网格搜索最佳参数¶

In [23]:
# 定义两个维度的参数范围
sample_ratios = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
n_estimators_range = [20, 40, 60, 80, 100, 120, 140, 160, 180, 200]

# 创建结果字典
results = {
    'sample_ratio': [],
    'n_estimators': [],
    'test_accuracy': []
}

# 为每个类别添加指标
class_names = ['no', 'yes']
for class_name in class_names:
    results[f'test_precision_{class_name}'] = []
    results[f'test_recall_{class_name}'] = []
    results[f'test_f1_{class_name}'] = []

# 进行网格搜索
for ratio in sample_ratios:
    for n_trees in n_estimators_range:
        rf = RandomForestClassifier(
            n_estimators=n_trees,
            max_samples=ratio,
            max_depth=10,
            random_state=42,
            bootstrap=True
        )
        
        # 训练模型
        rf.fit(X_train, y_train)
        
        # 预测
        y_test_pred = rf.predict(X_test)
        
        # 计算指标
        test_report = classification_report(y_test, y_test_pred, 
                                          target_names=class_names, 
                                          output_dict=True)
        
        # 存储结果
        results['sample_ratio'].append(ratio)
        results['n_estimators'].append(n_trees)
        results['test_accuracy'].append(rf.score(X_test, y_test))
        
        # 存储每个类别的指标
        for class_name in class_names:
            results[f'test_precision_{class_name}'].append(test_report[class_name]['precision'])
            results[f'test_recall_{class_name}'].append(test_report[class_name]['recall'])
            results[f'test_f1_{class_name}'].append(test_report[class_name]['f1-score'])

# ...现有代码保持不变直到 DataFrame 转换部分...

# 转换为DataFrame并创建热力图
results_df = pd.DataFrame(results)

# 创建热力图
plt.figure(figsize=(11*1.3, 8*1.3), dpi=300)

# 准确率热力图
accuracy_pivot = pd.pivot_table(
    data=results_df,
    values='test_accuracy',
    index='sample_ratio',
    columns='n_estimators'
)

plt.subplot(2, 2, 1)
sns.heatmap(accuracy_pivot, annot=True, fmt='.3f', cmap='YlOrRd')
plt.title('Test Accuracy')
plt.xlabel('Number of Trees')
plt.ylabel('Sample Ratio')

# 对每个类别绘制F1-score热力图
for i, class_name in enumerate(class_names, 2):
    f1_pivot = pd.pivot_table(
        data=results_df,
        values=f'test_f1_{class_name}',
        index='sample_ratio',
        columns='n_estimators'
    )
    plt.subplot(2, 2, i)
    sns.heatmap(f1_pivot, annot=True, fmt='.3f', cmap='YlOrRd')
    plt.title(f'F1-score (Class: {class_name})')
    plt.xlabel('Number of Trees')
    plt.ylabel('Sample Ratio')

plt.tight_layout()
plt.show()

# ...其余代码保持不变...

# 找出最佳参数组合
best_idx = results_df['test_accuracy'].idxmax()
best_params = results_df.loc[best_idx]
print("\n最佳参数组合:")
print(f"Sample Ratio: {best_params['sample_ratio']}")
print(f"Number of Trees: {best_params['n_estimators']}")
print(f"Test Accuracy: {best_params['test_accuracy']:.4f}")

print("\n每个类别在最佳参数下的性能:")
for class_name in class_names:
    print(f"\n{class_name}:")
    print(f"Precision: {best_params[f'test_precision_{class_name}']:.4f}")
    print(f"Recall: {best_params[f'test_recall_{class_name}']:.4f}")
    print(f"F1-score: {best_params[f'test_f1_{class_name}']:.4f}")
No description has been provided for this image
最佳参数组合:
Sample Ratio: 0.8
Number of Trees: 140.0
Test Accuracy: 0.9215

每个类别在最佳参数下的性能:

no:
Precision: 0.9349
Recall: 0.9798
F1-score: 0.9568

yes:
Precision: 0.7435
Recall: 0.4623
F1-score: 0.5701

研究不同抽样比例的影响(固定树的数量为50)¶

In [21]:
# 研究不同抽样比例的影响(固定树的数量为50)
sample_ratios = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

# 为每个类别创建结果字典
results = {
    'sample_ratio': [],
    'train_accuracy': [],
    'test_accuracy': []
}

# 为每个类别添加指标
class_names = ['no', 'yes'] 
for class_name in class_names:
    results[f'train_precision_{class_name}'] = []
    results[f'test_precision_{class_name}'] = []
    results[f'train_recall_{class_name}'] = []
    results[f'test_recall_{class_name}'] = []
    results[f'train_f1_{class_name}'] = []
    results[f'test_f1_{class_name}'] = []

# 创建一个大图用于显示所有混淆矩阵
plt.figure(figsize=(20, 4))
plt.suptitle('Confusion Matrices for Different Sample Ratios', y=1.05)

# 对每个抽样比例进行训练和测试
for idx, ratio in enumerate(sample_ratios):
    # 创建子图
    plt.subplot(2, len(sample_ratios), idx + 1)
    
    rf = RandomForestClassifier(
        n_estimators=140,    # 固定树的数量
        max_samples=ratio,  # 变化抽样比例
        max_depth=10,       # 保持与基础模型相同的树深度
        random_state=42,
        bootstrap=True
    )
    
    # 训练模型
    rf.fit(X_train, y_train)
    
    # 预测
    y_train_pred = rf.predict(X_train)
    y_test_pred = rf.predict(X_test)
    
    # 计算混淆矩阵
    cm_train = confusion_matrix(y_train, y_train_pred)
    cm_test = confusion_matrix(y_test, y_test_pred)
    
    # 绘制训练集混淆矩阵
    sns.heatmap(cm_train, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Train (ratio={ratio})')
    if idx == 0:
        plt.ylabel('True Label')
    plt.xlabel('Predicted')
    
    # 绘制测试集混淆矩阵
    plt.subplot(2, len(sample_ratios), idx + len(sample_ratios) + 1)
    sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Test (ratio={ratio})')
    if idx == 0:
        plt.ylabel('True Label')
    plt.xlabel('Predicted')
    
    # 计算各项指标
    train_report = classification_report(y_train, y_train_pred, 
                                       target_names=class_names, 
                                       output_dict=True)
    test_report = classification_report(y_test, y_test_pred, 
                                      target_names=class_names, 
                                      output_dict=True)
    
    # 存储基本结果
    results['sample_ratio'].append(ratio)
    results['train_accuracy'].append(rf.score(X_train, y_train))
    results['test_accuracy'].append(rf.score(X_test, y_test))
    
    # 存储每个类别的详细指标
    for class_name in class_names:
        results[f'train_precision_{class_name}'].append(train_report[class_name]['precision'])
        results[f'test_precision_{class_name}'].append(test_report[class_name]['precision'])
        results[f'train_recall_{class_name}'].append(train_report[class_name]['recall'])
        results[f'test_recall_{class_name}'].append(test_report[class_name]['recall'])
        results[f'train_f1_{class_name}'].append(train_report[class_name]['f1-score'])
        results[f'test_f1_{class_name}'].append(test_report[class_name]['f1-score'])

plt.tight_layout()
plt.show()

# 转换为DataFrame并绘制性能指标图
results_df = pd.DataFrame(results)

# 绘制性能指标图
plt.figure(figsize=(10, 6), dpi=300)

# Accuracy
plt.subplot(2, 2, 1)
plt.plot(results_df['sample_ratio'], results_df['test_accuracy'], 
         marker='o', label='Test Accuracy', linewidth=2, color='red')
plt.xlabel('Sample Ratio')
plt.ylabel('Accuracy')
plt.title('Test Accuracy vs Sample Ratio')
plt.grid(True)
plt.legend()

# Precision
plt.subplot(2, 2, 2)
for class_name in class_names:
    plt.plot(results_df['sample_ratio'], results_df[f'test_precision_{class_name}'], 
             marker='o', label=f'Class: {class_name}')
plt.xlabel('Sample Ratio')
plt.ylabel('Precision')
plt.title('Class-wise Test Precision vs Sample Ratio')
plt.grid(True)
plt.legend()

# Recall
plt.subplot(2, 2, 3)
for class_name in class_names:
    plt.plot(results_df['sample_ratio'], results_df[f'test_recall_{class_name}'], 
             marker='o', label=f'Class: {class_name}')
plt.xlabel('Sample Ratio')
plt.ylabel('Recall')
plt.title('Class-wise Test Recall vs Sample Ratio')
plt.grid(True)
plt.legend()

# F1-score
plt.subplot(2, 2, 4)
for class_name in class_names:
    plt.plot(results_df['sample_ratio'], results_df[f'test_f1_{class_name}'], 
             marker='o', label=f'Class: {class_name}')
plt.xlabel('Sample Ratio')
plt.ylabel('F1-score')
plt.title('Class-wise Test F1-score vs Sample Ratio')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()

# 打印最佳抽样比例及其性能
best_idx = results_df['test_accuracy'].idxmax()
best_ratio = results_df.loc[best_idx, 'sample_ratio']
print(f"\nBest Sample Ratio: {best_ratio}")
print(f"Test Accuracy with Best Ratio: {results_df.loc[best_idx, 'test_accuracy']:.4f}")

# 打印每个类别在最佳比例下的性能
print("\nClass-wise Test Performance at Best Ratio:")
for class_name in class_names:
    print(f"\n{class_name}:")
    print(f"Precision: {results_df.loc[best_idx, f'test_precision_{class_name}']:.4f}")
    print(f"Recall: {results_df.loc[best_idx, f'test_recall_{class_name}']:.4f}")
    print(f"F1-score: {results_df.loc[best_idx, f'test_f1_{class_name}']:.4f}")
No description has been provided for this image
No description has been provided for this image
Best Sample Ratio: 0.8
Test Accuracy with Best Ratio: 0.9215

Class-wise Test Performance at Best Ratio:

no:
Precision: 0.9349
Recall: 0.9798
F1-score: 0.9568

yes:
Precision: 0.7435
Recall: 0.4623
F1-score: 0.5701

研究不同树的数量的影响(固定抽样比例为0.6)¶

In [11]:
# 研究不同树的数量的影响(固定抽样比例为0.6)
n_estimators_range = [20,40,60,80,100,120,140,160,180,200,220,240,260,280,300]

# 为每个类别创建结果字典
results = {
    'n_estimators': [],
    'train_accuracy': [],
    'test_accuracy': []
}

class_names = ['no', 'yes'] 
# 为每个类别添加指标
for class_name in class_names:
    results[f'train_precision_{class_name}'] = []
    results[f'test_precision_{class_name}'] = []
    results[f'train_recall_{class_name}'] = []
    results[f'test_recall_{class_name}'] = []
    results[f'train_f1_{class_name}'] = []
    results[f'test_f1_{class_name}'] = []

# 对每个树的数量进行训练和测试
for n_trees in n_estimators_range:
    rf = RandomForestClassifier(
        n_estimators=n_trees,  # 变化树的数量
        max_samples=0.8,       # 固定抽样比例
        max_depth=10,           # 保持与基础模型相同的树深度
        random_state=42,
        bootstrap=True
    )
    
    # 训练模型
    rf.fit(X_train, y_train)
    
    # 预测
    y_train_pred = rf.predict(X_train)
    y_test_pred = rf.predict(X_test)
    
    # 计算各项指标
    train_report = classification_report(y_train, y_train_pred, 
                                       target_names=class_names, 
                                       output_dict=True)
    test_report = classification_report(y_test, y_test_pred, 
                                      target_names=class_names, 
                                      output_dict=True)
    
    # 存储基本结果
    results['n_estimators'].append(n_trees)
    results['train_accuracy'].append(rf.score(X_train, y_train))
    results['test_accuracy'].append(rf.score(X_test, y_test))
    
    # 存储每个类别的详细指标
    for class_name in class_names:
        results[f'train_precision_{class_name}'].append(train_report[class_name]['precision'])
        results[f'test_precision_{class_name}'].append(test_report[class_name]['precision'])
        results[f'train_recall_{class_name}'].append(train_report[class_name]['recall'])
        results[f'test_recall_{class_name}'].append(test_report[class_name]['recall'])
        results[f'train_f1_{class_name}'].append(train_report[class_name]['f1-score'])
        results[f'test_f1_{class_name}'].append(test_report[class_name]['f1-score'])

# 转换为DataFrame
results_df = pd.DataFrame(results)

# 绘制结果
plt.figure(figsize=(10, 6), dpi=300)

# Accuracy
plt.subplot(2, 2, 1)
plt.plot(results_df['n_estimators'], results_df['test_accuracy'], 
         marker='o', label='Test Accuracy', linewidth=2, color='red')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.title('Test Accuracy vs Number of Trees')
plt.grid(True)
plt.legend()

# Precision
plt.subplot(2, 2, 2)
for class_name in class_names:
    plt.plot(results_df['n_estimators'], results_df[f'test_precision_{class_name}'], 
             marker='o', label=f'Class: {class_name}')
plt.xlabel('Number of Trees')
plt.ylabel('Precision')
plt.title('Class-wise Test Precision vs Number of Trees')
plt.grid(True)
plt.legend()

# Recall
plt.subplot(2, 2, 3)
for class_name in class_names:
    plt.plot(results_df['n_estimators'], results_df[f'test_recall_{class_name}'], 
             marker='o', label=f'Class: {class_name}')
plt.xlabel('Number of Trees')
plt.ylabel('Recall')
plt.title('Class-wise Test Recall vs Number of Trees')
plt.grid(True)
plt.legend()

# F1-score
plt.subplot(2, 2, 4)
for class_name in class_names:
    plt.plot(results_df['n_estimators'], results_df[f'test_f1_{class_name}'], 
             marker='o', label=f'Class: {class_name}')
plt.xlabel('Number of Trees')
plt.ylabel('F1-score')
plt.title('Class-wise Test F1-score vs Number of Trees')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()

# 打印最佳树的数量及其性能
best_idx = results_df['test_accuracy'].idxmax()
best_n_trees = results_df.loc[best_idx, 'n_estimators']
print(f"\nBest Number of Trees: {best_n_trees}")
print(f"Test Accuracy with Best Trees: {results_df.loc[best_idx, 'test_accuracy']:.4f}")

# 打印每个类别在最佳树数量下的性能
print("\nClass-wise Test Performance at Best Number of Trees:")
for class_name in class_names:
    print(f"\n{class_name}:")
    print(f"Precision: {results_df.loc[best_idx, f'test_precision_{class_name}']:.4f}")
    print(f"Recall: {results_df.loc[best_idx, f'test_recall_{class_name}']:.4f}")
    print(f"F1-score: {results_df.loc[best_idx, f'test_f1_{class_name}']:.4f}")
No description has been provided for this image
Best Number of Trees: 140
Test Accuracy with Best Trees: 0.9215

Class-wise Test Performance at Best Number of Trees:

no:
Precision: 0.9349
Recall: 0.9798
F1-score: 0.9568

yes:
Precision: 0.7435
Recall: 0.4623
F1-score: 0.5701

单颗树情形vs随机森林¶

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# 假设已经有 X_train, y_train, X_test, y_test
# 这里需要确保数据已经正确加载和划分
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建不同深度的决策树和随机森林
max_depths = [3, 5, 10, 15, 20, 30, None]  # None表示不限制深度
results = {
    'max_depth': [],
    'dt_train_accuracy': [],
    'dt_test_accuracy': [],
    'rf_train_accuracy': [],
    'rf_test_accuracy': [],
    'dt_model': [],  # 存储训练好的决策树模型
    'rf_model': [],  # 存储训练好的随机森林模型
    'dt_train_precision': [],
    'dt_test_precision': [],
    'dt_train_recall': [],
    'dt_test_recall': [],
    'dt_train_f1': [],
    'dt_test_f1': [],
    'rf_train_precision': [],
    'rf_test_precision': [],
    'rf_train_recall': [],
    'rf_test_recall': [],
    'rf_train_f1': [],
    'rf_test_f1': []
}

# 测试不同深度的决策树和随机森林
for depth in max_depths:
    # 存储当前深度
    results['max_depth'].append(str(depth))

    # 决策树
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)  # 只训练一次

    # 在训练集和测试集上预测
    y_train_pred_dt = dt.predict(X_train)
    y_test_pred_dt = dt.predict(X_test)

    # 在训练集和测试集上评估准确率
    dt_train_accuracy = dt.score(X_train, y_train)
    dt_test_accuracy = dt.score(X_test, y_test)

    # 计算决策树在训练集和测试集的精确率、召回率和F1Score
    dt_train_precision = precision_score(y_train, y_train_pred_dt, average='weighted')
    dt_test_precision = precision_score(y_test, y_test_pred_dt, average='weighted')
    dt_train_recall = recall_score(y_train, y_train_pred_dt, average='weighted')
    dt_test_recall = recall_score(y_test, y_test_pred_dt, average='weighted')
    dt_train_f1 = f1_score(y_train, y_train_pred_dt, average='weighted')
    dt_test_f1 = f1_score(y_test, y_test_pred_dt, average='weighted')

    # 存储决策树结果
    results['dt_train_accuracy'].append(dt_train_accuracy)
    results['dt_test_accuracy'].append(dt_test_accuracy)
    results['dt_model'].append(dt)
    results['dt_train_precision'].append(dt_train_precision)
    results['dt_test_precision'].append(dt_test_precision)
    results['dt_train_recall'].append(dt_train_recall)
    results['dt_test_recall'].append(dt_test_recall)
    results['dt_train_f1'].append(dt_train_f1)
    results['dt_test_f1'].append(dt_test_f1)

    # 随机森林
    rf = RandomForestClassifier(
        n_estimators=100,
        max_depth=depth,
        random_state=42
    )
    rf.fit(X_train, y_train)  # 只训练一次

    # 在训练集和测试集上预测
    y_train_pred_rf = rf.predict(X_train)
    y_test_pred_rf = rf.predict(X_test)

    # 在训练集和测试集上评估准确率
    rf_train_accuracy = rf.score(X_train, y_train)
    rf_test_accuracy = rf.score(X_test, y_test)

    # 计算随机森林在训练集和测试集的精确率、召回率和F1Score
    rf_train_precision = precision_score(y_train, y_train_pred_rf, average='weighted')
    rf_test_precision = precision_score(y_test, y_test_pred_rf, average='weighted')
    rf_train_recall = recall_score(y_train, y_train_pred_rf, average='weighted')
    rf_test_recall = recall_score(y_test, y_test_pred_rf, average='weighted')
    rf_train_f1 = f1_score(y_train, y_train_pred_rf, average='weighted')
    rf_test_f1 = f1_score(y_test, y_test_pred_rf, average='weighted')

    # 存储随机森林结果
    results['rf_train_accuracy'].append(rf_train_accuracy)
    results['rf_test_accuracy'].append(rf_test_accuracy)
    results['rf_model'].append(rf)
    results['rf_train_precision'].append(rf_train_precision)
    results['rf_test_precision'].append(rf_test_precision)
    results['rf_train_recall'].append(rf_train_recall)
    results['rf_test_recall'].append(rf_test_recall)
    results['rf_train_f1'].append(rf_train_f1)
    results['rf_test_f1'].append(rf_test_f1)

# 可视化每个指标
metrics = [
    ('Accuracy', 'dt_train_accuracy', 'dt_test_accuracy', 'rf_train_accuracy', 'rf_test_accuracy'),
    ('Precision', 'dt_train_precision', 'dt_test_precision', 'rf_train_precision', 'rf_test_precision'),
    ('Recall', 'dt_train_recall', 'dt_test_recall', 'rf_train_recall', 'rf_test_recall'),
    ('F1 Score', 'dt_train_f1', 'dt_test_f1', 'rf_train_f1', 'rf_test_f1')
]

for metric_name, *metric_keys in metrics:
    plt.figure(figsize=(10, 5), dpi=300)
    plt.plot(results['max_depth'], results[metric_keys[0]], 'o-', label='Decision Tree Train', color='blue')
    plt.plot(results['max_depth'], results[metric_keys[1]], 'o--', label='Decision Tree Test', color='lightblue')
    plt.plot(results['max_depth'], results[metric_keys[2]], 's-', label='Random Forest Train', color='red')
    plt.plot(results['max_depth'], results[metric_keys[3]], 's--', label='Random Forest Test', color='lightcoral')
    plt.title(f'{metric_name} Comparison')
    plt.xlabel('Max Depth')
    plt.ylabel(metric_name)
    plt.legend()
    plt.grid(True)
    plt.show()


# 打印详细结果并分析过拟合
print("\n不同深度下的模型表现比较:")
for i, depth in enumerate(max_depths):
    print(f"\n最大深度: {depth}")
    
    # 获取当前深度下的模型
    dt = results['dt_model'][i]
    rf = results['rf_model'][i]
    
    print("决策树:")
    print(f"训练集准确率: {results['dt_train_accuracy'][i]:.4f}")
    print(f"测试集准确率: {results['dt_test_accuracy'][i]:.4f}")
    print(f"过拟合程度: {results['dt_train_accuracy'][i] - results['dt_test_accuracy'][i]:.4f}")
    
    # 输出详细的分类报告
    y_train_pred_dt = dt.predict(X_train)
    y_test_pred_dt = dt.predict(X_test)
    print("\n决策树训练集分类报告:")
    print(classification_report(y_train, y_train_pred_dt))
    print("\n决策树测试集分类报告:")
    print(classification_report(y_test, y_test_pred_dt))
    
    print("\n随机森林:")
    print(f"训练集准确率: {results['rf_train_accuracy'][i]:.4f}")
    print(f"测试集准确率: {results['rf_test_accuracy'][i]:.4f}")
    print(f"过拟合程度: {results['rf_train_accuracy'][i] - results['rf_test_accuracy'][i]:.4f}")
    
    # 输出详细的分类报告
    y_train_pred_rf = rf.predict(X_train)
    y_test_pred_rf = rf.predict(X_test)
    print("\n随机森林训练集分类报告:")
    print(classification_report(y_train, y_train_pred_rf))
    print("\n随机森林测试集分类报告:")
    print(classification_report(y_test, y_test_pred_rf))
    
    print("-"*50)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
不同深度下的模型表现比较:

最大深度: 3
决策树:
训练集准确率: 0.9082
测试集准确率: 0.9133
过拟合程度: -0.0051

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95     29238
           1       0.60      0.58      0.59      3712

    accuracy                           0.91     32950
   macro avg       0.77      0.76      0.77     32950
weighted avg       0.91      0.91      0.91     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      7310
           1       0.62      0.59      0.61       928

    accuracy                           0.91      8238
   macro avg       0.78      0.77      0.78      8238
weighted avg       0.91      0.91      0.91      8238


随机森林:
训练集准确率: 0.9014
测试集准确率: 0.9019
过拟合程度: -0.0005

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       0.90      0.99      0.95     29238
           1       0.81      0.16      0.27      3712

    accuracy                           0.90     32950
   macro avg       0.85      0.58      0.61     32950
weighted avg       0.89      0.90      0.87     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      7310
           1       0.85      0.16      0.27       928

    accuracy                           0.90      8238
   macro avg       0.88      0.58      0.61      8238
weighted avg       0.90      0.90      0.87      8238

--------------------------------------------------

最大深度: 5
决策树:
训练集准确率: 0.9163
测试集准确率: 0.9181
过拟合程度: -0.0018

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.97      0.95     29238
           1       0.66      0.53      0.59      3712

    accuracy                           0.92     32950
   macro avg       0.80      0.75      0.77     32950
weighted avg       0.91      0.92      0.91     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.97      0.95      7310
           1       0.68      0.52      0.59       928

    accuracy                           0.92      8238
   macro avg       0.81      0.74      0.77      8238
weighted avg       0.91      0.92      0.91      8238


随机森林:
训练集准确率: 0.9061
测试集准确率: 0.9070
过拟合程度: -0.0009

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       0.91      0.99      0.95     29238
           1       0.80      0.22      0.35      3712

    accuracy                           0.91     32950
   macro avg       0.85      0.61      0.65     32950
weighted avg       0.90      0.91      0.88     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.91      0.99      0.95      7310
           1       0.83      0.22      0.35       928

    accuracy                           0.91      8238
   macro avg       0.87      0.61      0.65      8238
weighted avg       0.90      0.91      0.88      8238

--------------------------------------------------

最大深度: 10
决策树:
训练集准确率: 0.9385
测试集准确率: 0.9147
过拟合程度: 0.0239

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97     29238
           1       0.76      0.67      0.71      3712

    accuracy                           0.94     32950
   macro avg       0.86      0.82      0.84     32950
weighted avg       0.94      0.94      0.94     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.95      0.96      0.95      7310
           1       0.63      0.57      0.60       928

    accuracy                           0.91      8238
   macro avg       0.79      0.76      0.78      8238
weighted avg       0.91      0.91      0.91      8238


随机森林:
训练集准确率: 0.9403
测试集准确率: 0.9204
过拟合程度: 0.0199

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.99      0.97     29238
           1       0.88      0.54      0.67      3712

    accuracy                           0.94     32950
   macro avg       0.91      0.77      0.82     32950
weighted avg       0.94      0.94      0.93     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.93      0.98      0.96      7310
           1       0.76      0.43      0.55       928

    accuracy                           0.92      8238
   macro avg       0.84      0.71      0.75      8238
weighted avg       0.91      0.92      0.91      8238

--------------------------------------------------

最大深度: 15
决策树:
训练集准确率: 0.9755
测试集准确率: 0.9045
过拟合程度: 0.0710

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     29238
           1       0.91      0.87      0.89      3712

    accuracy                           0.98     32950
   macro avg       0.95      0.93      0.94     32950
weighted avg       0.98      0.98      0.98     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.95      0.95      0.95      7310
           1       0.58      0.57      0.58       928

    accuracy                           0.90      8238
   macro avg       0.76      0.76      0.76      8238
weighted avg       0.90      0.90      0.90      8238


随机森林:
训练集准确率: 0.9818
测试集准确率: 0.9221
过拟合程度: 0.0597

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     29238
           1       0.99      0.85      0.91      3712

    accuracy                           0.98     32950
   macro avg       0.99      0.92      0.95     32950
weighted avg       0.98      0.98      0.98     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.97      0.96      7310
           1       0.70      0.53      0.61       928

    accuracy                           0.92      8238
   macro avg       0.82      0.75      0.78      8238
weighted avg       0.92      0.92      0.92      8238

--------------------------------------------------

最大深度: 20
决策树:
训练集准确率: 0.9975
测试集准确率: 0.8972
过拟合程度: 0.1003

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       0.99      0.99      0.99      3712

    accuracy                           1.00     32950
   macro avg       0.99      0.99      0.99     32950
weighted avg       1.00      1.00      1.00     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.95      0.94      0.94      7310
           1       0.54      0.58      0.56       928

    accuracy                           0.90      8238
   macro avg       0.74      0.76      0.75      8238
weighted avg       0.90      0.90      0.90      8238


随机森林:
训练集准确率: 0.9988
测试集准确率: 0.9221
过拟合程度: 0.0767

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      0.99      0.99      3712

    accuracy                           1.00     32950
   macro avg       1.00      0.99      1.00     32950
weighted avg       1.00      1.00      1.00     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.97      0.96      7310
           1       0.70      0.54      0.61       928

    accuracy                           0.92      8238
   macro avg       0.82      0.75      0.78      8238
weighted avg       0.92      0.92      0.92      8238

--------------------------------------------------

最大深度: 30
决策树:
训练集准确率: 1.0000
测试集准确率: 0.8956
过拟合程度: 0.1044

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      1.00      1.00      3712

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.94      0.94      7310
           1       0.53      0.57      0.55       928

    accuracy                           0.90      8238
   macro avg       0.74      0.75      0.75      8238
weighted avg       0.90      0.90      0.90      8238


随机森林:
训练集准确率: 1.0000
测试集准确率: 0.9205
过拟合程度: 0.0795

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      1.00      1.00      3712

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.97      0.96      7310
           1       0.69      0.54      0.60       928

    accuracy                           0.92      8238
   macro avg       0.82      0.75      0.78      8238
weighted avg       0.91      0.92      0.92      8238

--------------------------------------------------

最大深度: None
决策树:
训练集准确率: 1.0000
测试集准确率: 0.8956
过拟合程度: 0.1044

决策树训练集分类报告:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      1.00      1.00      3712

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950


决策树测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.94      0.94      7310
           1       0.53      0.57      0.55       928

    accuracy                           0.90      8238
   macro avg       0.74      0.75      0.75      8238
weighted avg       0.90      0.90      0.90      8238


随机森林:
训练集准确率: 1.0000
测试集准确率: 0.9210
过拟合程度: 0.0790

随机森林训练集分类报告:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      1.00      1.00      3712

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950


随机森林测试集分类报告:
              precision    recall  f1-score   support

           0       0.94      0.97      0.96      7310
           1       0.69      0.54      0.61       928

    accuracy                           0.92      8238
   macro avg       0.82      0.76      0.78      8238
weighted avg       0.91      0.92      0.92      8238

--------------------------------------------------

重复树¶

In [26]:
from sklearn.tree import export_text
import hashlib
# ... existing code ...

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

# 定义树的数量,从 5 到 1000 指数增长
n_estimators_values = np.unique(np.geomspace(5, 500, 10).astype(int))
seeds = np.arange(5)

ratios = []

for n_estimators in n_estimators_values:
    unique_tree_counts = []
    for seed in seeds:
        # 创建随机森林模型,抽样比例为 0.8
        rf = RandomForestClassifier(
            n_estimators=n_estimators,
            max_samples=0.8,
            random_state=seed
        )
        rf.fit(X_train, y_train)

        # 检测重复树
        duplicates, unique_trees = check_duplicate_trees(rf)
        unique_tree_counts.append(unique_trees)

    # 计算平均不重复树的比例
    avg_unique = np.mean(unique_tree_counts)
    ratio = avg_unique / n_estimators
    ratios.append(ratio)

# 绘制折线图
plt.figure(figsize=(10, 5))
plt.plot(n_estimators_values, ratios, marker='o', linestyle='-')
plt.xlabel('Number of Trees')
plt.ylabel('Unique Trees / Total Trees')
plt.title('Ratio of Unique Trees to Total Trees with Different Number of Trees')
plt.grid(True)
plt.xscale('log')  # 使用对数刻度以便更好展示指数增长
plt.show()

# 绘制柱状图
plt.figure(figsize=(10, 5))
plt.bar(n_estimators_values, ratios, width=[i * 0.2 for i in n_estimators_values])
plt.xlabel('Number of Trees')
plt.ylabel('Unique Trees / Total Trees')
plt.title('Ratio of Unique Trees to Total Trees with Different Number of Trees (Bar Chart)')
plt.grid(True)
plt.xscale('log')  # 使用对数刻度以便更好展示指数增长
plt.show()

# ... existing code ...
def get_tree_structure(tree):
    """获取树的结构并返回其哈希值"""
    tree_str = export_text(tree, feature_names=list(X.columns))
    return hashlib.md5(tree_str.encode()).hexdigest()

def check_duplicate_trees(rf_model):
    """检查随机森林中的重复树"""
    tree_structures = {}
    duplicates = []
    
    # 遍历所有树
    for idx, tree in enumerate(rf_model.estimators_):
        tree_hash = get_tree_structure(tree)
        
        # 如果这个结构已经存在,说明找到了重复树
        if tree_hash in tree_structures:
            duplicates.append((idx, tree_structures[tree_hash]))
        else:
            tree_structures[tree_hash] = idx
    
    return duplicates, len(tree_structures)

# 创建一个随机森林模型进行测试
rf = RandomForestClassifier(
    n_estimators=100,  
    max_depth=5,      
    random_state=42
)
rf.fit(X_train, y_train)

# 检测重复树
duplicates, unique_trees = check_duplicate_trees(rf)

# 输出结果
print(f"总树数: {len(rf.estimators_)}")
print(f"唯一树数: {unique_trees}")
print(f"重复树数: {len(rf.estimators_) - unique_trees}")
print("\n重复树的索引对:")
for dup in duplicates:
    print(f"树 {dup[0]} 与树 {dup[1]} 结构相同")

# 可视化重复树比例
plt.figure(figsize=(6, 6))
plt.pie([unique_trees, len(rf.estimators_) - unique_trees], 
        labels=['Unique Trees', 'Duplicate Trees'],
        autopct='%1.1f%%',
        colors=['lightblue', 'lightcoral'])
plt.title('Proportion of Unique vs Duplicate Trees')
plt.show()

# 研究不同参数对重复树的影响
param_combinations = [
    {'n_estimators': 50, 'max_depth': 3, 'max_features': 'sqrt'},
    {'n_estimators': 50, 'max_depth': 5, 'max_features': 'sqrt'},
    {'n_estimators': 50, 'max_depth': 10, 'max_features': 'sqrt'},
    {'n_estimators': 50, 'max_depth': 3, 'max_features': None},
    {'n_estimators': 50, 'max_depth': 5, 'max_features': None},
    {'n_estimators': 50, 'max_depth': 10, 'max_features': None},
]

results = []
for params in param_combinations:
    rf = RandomForestClassifier(**params, random_state=42)
    rf.fit(X_train, y_train)
    duplicates, unique_trees = check_duplicate_trees(rf)
    
    results.append({
        'max_depth': params['max_depth'],
        'max_features': params['max_features'],
        'unique_trees': unique_trees,
        'duplicate_trees': params['n_estimators'] - unique_trees
    })

# 将结果转换为DataFrame并可视化
results_df = pd.DataFrame(results)
plt.figure(figsize=(10, 5))

x = np.arange(len(results_df))
width = 0.35

plt.bar(x - width/2, results_df['unique_trees'], width, label='Unique Trees')
plt.bar(x + width/2, results_df['duplicate_trees'], width, label='Duplicate Trees')

plt.xlabel('Parameter Combinations')
plt.ylabel('Number of Trees')
plt.title('Effect of Parameters on Tree Uniqueness')
plt.xticks(x, [f'depth={d}\n{"all" if f is None else "sqrt"} features' 
               for d, f in zip(results_df['max_depth'], results_df['max_features'])])
plt.legend()
plt.tight_layout()
plt.show()

# 输出详细的参数影响分析
print("\n参数对树重复性的影响分析:")
for idx, row in results_df.iterrows():
    print(f"\nmax_depth={row['max_depth']}, max_features={'all' if row['max_features'] is None else 'sqrt'}:")
    print(f"唯一树数: {row['unique_trees']}")
    print(f"重复树数: {row['duplicate_trees']}")
    print(f"重复树比例: {row['duplicate_trees']/50*100:.1f}%")
No description has been provided for this image
No description has been provided for this image
总树数: 100
唯一树数: 100
重复树数: 0

重复树的索引对:
No description has been provided for this image
No description has been provided for this image
参数对树重复性的影响分析:

max_depth=3, max_features=sqrt:
唯一树数: 50
重复树数: 0
重复树比例: 0.0%

max_depth=5, max_features=sqrt:
唯一树数: 50
重复树数: 0
重复树比例: 0.0%

max_depth=10, max_features=sqrt:
唯一树数: 50
重复树数: 0
重复树比例: 0.0%

max_depth=3, max_features=all:
唯一树数: 50
重复树数: 0
重复树比例: 0.0%

max_depth=5, max_features=all:
唯一树数: 50
重复树数: 0
重复树比例: 0.0%

max_depth=10, max_features=all:
唯一树数: 50
重复树数: 0
重复树比例: 0.0%